Crate kmeans

source ·
Expand description

kmeans - API documentation

Kmeans is a small rust library for the calculation of k-means-clustering.

Design target

It’s main target is high performance / throughput, you will therefore find most of its API-surface rather plain. An example of this is, that samples are given using a raw vector, instead of any high-level arithmetics / matrix crate such as nalgebra or ndarray. For this same reason, the crate is internally using hand-written SIMD operations and (unfortunately) quite a bit of unsafe code, which proved to lead to insane speedups.

Supported variants

K-Means clustering is not one algorithm, but more like a concept describing the outcome. There are differing implementations and variants using differing levels of approximations. For a list of supported variants, have a look at the documentation of KMeans.

Supported centroid initializations

The outcome of each K-Means run depends on the initialization of its clusters. There exist multiple algorithms for this initialization, most of which are based on at least some degree of randomness. For a list of implemented initialization methods, see KMeans.

Supported primitive types

Example

Both: Supported variants and supported centroid initializations can be combined at will. Here is an example showing the default full k-Means implementation, using K-Mean++ initialization:

use kmeans::*;

fn main() {
    let (sample_cnt, sample_dims, k, max_iter) = (20000, 200, 4, 100);
 
    // Generate some random data
    let mut samples = vec![0.0f64;sample_cnt * sample_dims];
    samples.iter_mut().for_each(|v| *v = rand::random());
 
    // Calculate kmeans, using kmean++ as initialization-method
    let kmean = KMeans::new(samples, sample_cnt, sample_dims);
    let result = kmean.kmeans_lloyd(k, max_iter, KMeans::init_kmeanplusplus, &KMeansConfig::default());
 
    println!("Centroids: {:?}", result.centroids);
    println!("Cluster-Assignments: {:?}", result.assignments);
    println!("Error: {}", result.distsum);
}

Example (using the status event callbacks)

use kmeans::*;
 
fn main() {
    let (sample_cnt, sample_dims, k, max_iter) = (20000, 200, 4, 2500);
 
    // Generate some random data
    let mut samples = vec![0.0f64;sample_cnt * sample_dims];
    samples.iter_mut().for_each(|v| *v = rand::random());
 
    let conf = KMeansConfig::build()
	    .init_done(&|_| println!("Initialization completed."))
	    .iteration_done(&|s, nr, new_distsum|
		    println!("Iteration {} - Error: {:.2} -> {:.2} | Improvement: {:.2}",
			    nr, s.distsum, new_distsum, s.distsum - new_distsum))
	    .build();

    // Calculate kmeans, using kmean++ as initialization-method
    let kmean = KMeans::new(samples, sample_cnt, sample_dims);
    let result = kmean.kmeans_minibatch(4, k, max_iter, KMeans::init_random_sample, &conf);
 
    println!("Centroids: {:?}", result.centroids);
    println!("Cluster-Assignments: {:?}", result.assignments);
    println!("Error: {}", result.distsum);
}

Short API-Overview / Description

Entry-point of the library is the KMeans struct. This struct is generic over the underlying primitive type, that should be used for the calculations. To use KMeans, an instance of this struct is created, taking over the sample data into its ownership (and doing some memory-related optimizations).

Note: The input-data has to use the same primitive as the required output-data (distances).

The KMeans struct’s instance-methods represent the supported k-Means variants & implementations. Calling such a method (e.g. KMeans::kmeans_lloyd) on the struct does not mutate it, so multiple runs can be done in parallel (the algorithm itself is already parallellized though). Internally, a new instance of KMeansState is used to store the state (and finally the result) of a K-Means calculation.

All of the instance-methods take multiple arguments. One of which is the chosen centroid initialization method. These initialization-method implementations are static methods within the KMeans struct, which are simply passed in as reference.

Structs

  • Entrypoint of this crate’s API-Surface.
  • This is a structure holding various configuration options for the a k-means calculations, such as the random number generator to use, or a couple of callbacks, that can be set to get status information from a running k-means calculation.
  • This is the internally used data-structure, storing the current state during calculation, as well as the final result, as returned by the API. All mutations are done in this structure, making KMeans immutable, and therefore allowing it to be used in parallel, without having to duplicate the input-data.

Enums

  • Enum with possible abort strategies. These strategies specify when a running iteration (with the k-means calculation) is aborted.

Traits